Group 11

Introduction



1) Load Data

udemy <- read.csv("Data/udemy_courses.csv")
udemy

2) Load Packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at C:/Users/palla/Desktop/DAB501/Group_Project/DAB501_Project
library(ggplot2)
library(ggthemes)
library(gganimate)
library(tidyr)
library(dplyr)
library(quantreg)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(gifski)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.4
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(vcd)
## Warning: package 'vcd' was built under R version 4.0.4
## Loading required package: grid
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(treemapify)
## Warning: package 'treemapify' was built under R version 4.0.4
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.0.4
library(viridis)
## Warning: package 'viridis' was built under R version 4.0.4
## Loading required package: viridisLite

3) Head of the Data

head(udemy)

4) Tail of the Data

tail(udemy)

5) Summary of the Data

summary(udemy)
##    course_id       course_title           url             is_paid       
##  Min.   :   8324   Length:3676        Length:3676        Mode :logical  
##  1st Qu.: 407937   Class :character   Class :character   FALSE:308      
##  Median : 688168   Mode  :character   Mode  :character   TRUE :3368     
##  Mean   : 676313                                                        
##  3rd Qu.: 961539                                                        
##  Max.   :1282064                                                        
##      price        num_subscribers     num_reviews     num_lectures   
##  Min.   :  0.00   Min.   :     0.0   Min.   :    1   Min.   :  0.00  
##  1st Qu.: 20.00   1st Qu.:   110.8   1st Qu.:    4   1st Qu.: 15.00  
##  Median : 45.00   Median :   910.0   Median :   18   Median : 25.00  
##  Mean   : 66.09   Mean   :  3081.9   Mean   :  154   Mean   : 40.11  
##  3rd Qu.: 95.00   3rd Qu.:  2534.8   3rd Qu.:   67   3rd Qu.: 46.00  
##  Max.   :200.00   Max.   :121584.0   Max.   :27445   Max.   :779.00  
##     level           content_duration      year        subject         
##  Length:3676        Min.   : 0.000   Min.   :2011   Length:3676       
##  Class :character   1st Qu.: 1.000   1st Qu.:2015   Class :character  
##  Mode  :character   Median : 2.000   Median :2016   Mode  :character  
##                     Mean   : 4.093   Mean   :2015                     
##                     3rd Qu.: 4.500   3rd Qu.:2016                     
##                     Max.   :78.500   Max.   :2017

Pallavi Ravikumar Menon

1. Univariate Analysis

1.1 Numeric Variable

1. Create an appropriate plot to visualize the distribution of this variable. (4 marks)


plot1 <- ggplot(udemy, aes(x=log10(num_reviews + 1))) 

plot1 + geom_histogram(bins = 20,fill = "#00AFBB", alpha = 0.5)+ 
  geom_vline(aes(xintercept= mean(log10(num_reviews + 1))), color= "#0073C2FF", size = 0.8)+
  geom_vline(aes(xintercept= median(log10(num_reviews + 1))), linetype = "dashed", color =  "#FC4E07", size = 0.8)+
  ggtitle("Distribution of Number Of Reviews") + 
  labs(x = "Number of Reviews",  y = "Frequency")+ theme_minimal() + theme(plot.title = element_text (hjust = 0.5)) 


2. Consider any outliers present in the data. If present, specify the criteria used to identify them and provide a logical explanation for how you handled them.(4 marks)

There were outliers in the variable so i have performed data transformation to get concise distribution for the variable.br>
Can refer to the plot in Question 4 depicting three graphs and respective transformation.br>


3. Describe the shape and skewness of the distribution. (2 marks)

shape = Unimodal
skewness = Right Skewed as mean > median


4. Based on your answer to the previous question, decide if it is appropriate to apply a transformation to your data. If no, explain why not. If yes, name the transformation applied and visualize the transformed distribution. (This video and this video may help.) (4 marks)

a)It was evident to apply data transformation in order to get perfect distribution.This is how i transformed the data :

p1 <- ggplot(udemy, aes(x=num_reviews))

summary(udemy$num_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       4      18     154      67   27445
summary(log10(udemy$num_reviews+1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.301   0.699   1.279   1.340   1.833   4.438
summary(sqrt(udemy$num_reviews))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.243   7.120   8.185 165.665

b)Plot all three using “GridExtra” package to understand which distribution is ideal.

p1 <- qplot(x= num_reviews, data=udemy)
p2 <- qplot(x=log10(num_reviews + 1), data = udemy)
p3 <- qplot(x= sqrt(num_reviews), data = udemy)

grid.arrange(p1,p2,p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

c)Out of all three, the second plot is ideal as it shows distribution spread out.


5. Choose and calculate an appropriate measure of central tendency. (3 marks)

udemy %>% summarise(median(log10(num_reviews + 1)))

6. Explain why you chose this as your measure of central tendency. Provide supporting evidence for your choice. (4 marks)

Median is ideal for the above chart because:

For distributions have skewness and outliers. median is preferred measure of central tendency because the median is least affected by outliers.
As seen from the graph, the mean value is pulled towards the direction of skewness.
This potrays that mean gets affected by skewness or outliers.


7. Choose and calculate a measure of spread that is appropriate for your chosen measure of central tendency. Explain why you chose this as your measure of spread. (2 marks)

udemy %>% summarise(IQR(log10(num_reviews + 1)))

The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile are, indicates how spread out the middle 50% of our set of data is. The primary advantage of using the interquartile range rather than the range for the measurement of the spread of a data set is that the interquartile range is not sensitive to outliers.

1.2 Categorical Variable

1. Create an appropriate plot to visualize the distribution of counts for this variable. (4 marks)

plot2 <- udemy %>% group_by(subject) %>% 
         summarize(count = n()) %>%
         plot_ly(labels = ~subject , values = ~count) %>%
         add_pie(hole=0.6) %>%
         layout(title = "Distribution of Count for Subjects",showlegend = F,
                xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
plot2

2. Create an appropriate plot to visualize the distribution of proportions for this variable. (4 marks)

ggplot(udemy, aes(subject, fill = subject)) + 
  geom_bar(stat = "count") + coord_flip()+
  ggtitle("Distribution of proportions for Subjects") + 
  labs(x = "Frequency",  y = "Types of Subjects", fill = "Subjects")+ theme_minimal() + theme(plot.title = element_text (hjust = 0.5)) 


3. Discuss any unusual observations for this variable? (2 marks)

Observing both the graph there doesn’t seem to be unusual observation infact it clearly gives out the message to the observer.
Both the graph successfully generates valuable insights.


4. Discuss if there are too few/too many unique values? (2 marks)

There are no unique values affecting the observations.
Both the graph is consice and accurate.

2. Bivariate Analysis

2.1 Plot 1

1. Create an appropriate plot to visualize the relationship between the two variables, where both are numeric. (4 marks)

ggplot(udemy, aes(price,num_lectures)) +
  geom_bin2d(bins = 20, color ="white")+
  scale_fill_gradient(low =  "#00AFBB", high = "#FC4E07")+
  ggtitle("Relationship between two numerical values") + 
  labs(x = "Price",  y = "Number of Lectures")+ theme_minimal() + theme(plot.title = element_text (hjust = 0.5)) 


2. Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate. (4 marks)

The graph depicts alternative to scatter plot using “geom_bin2d” plot.
The number of lectures are plotted against the course price respectively.
This depicts a moderately strong relationship which is positive as we see the number of lectures increases with increase in the course price. There are few outliers and exceptions in the graph.


3. Explain what this relationship means in the context of the data. (4 marks)

As per the context it surely successfully depicts the relationship between both the numerical variable.
Looking at the graph the viewer can estimate there is direct proportionality between two variables.


4. Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above. (3 marks)

There seems to be some variations and exception in the plot. Some lower price values depicts largest count in terms of number of lectures which goes contradictory to our observation.
Also there are outliers present in the plot which surely distracts us from accurate observation.

2.2 Plot 2

1. Create an appropriate plot to visualize the relationship between the two variables,where one variable is categorical and the other is numeric (4 marks)

ggplot(udemy, 
       aes(x = price, 
           y = level, 
           fill = level)) +
  geom_density_ridges() + 
  theme_ridges() +
  theme(legend.position = "none")+
  labs(title = "Relationship between numeric and categorical",
       x = "Prices",
       y = "Course Levels", fill = "Levels")+
  theme_minimal()+ theme(plot.title = element_text (hjust = 0.5))
## Picking joint bandwidth of 13.5


2. Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate. (4 marks)

The graph depicts the relationship between the levels of each subject with respect to the prices.
The chart depicts a strong relationship between both the variables as we can see peak rise and fall for different levels of course for a defined price range. We can observe and conclude an understanding between both the variables.


3. Explain what this relationship means in the context of the data. (4 marks)

It clearly justifies that for price range 0-50$ there ar more courses for Intermediate and Beginner Level. Further, observing different price range we can clearly understand the count and demand for each evel of course.


4. Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above. (3 marks)

There is no such variability observed or that could affect the insight that we conclude from the visualization.
However, there can be further breakdown to understand price range of 100-150$ as it doesn’t give a clear picture for all levels.

Akshay Joshi

Univariate Analysis
Univariate plot 1

udemy_bar1 <- ggplot(udemy, aes(x = price)) 
udemy_bar1 + geom_histogram(binwidth = 20,color = "darkslategray",
                       fill = "lightblue") + ggtitle("Course content duration in Udemy") + 
  labs(x = "Price", y = "Number of lectures") + 
  geom_vline(aes(xintercept = median(num_lectures)),color = "blue", size = 1) +
  geom_vline(aes(xintercept = mean(num_lectures)),color = "red", 
             linetype = "dashed", size = 1)


The plot above tells us the distribution of prices with respect to the number of lectures pertaining to those prices. As the plotted histogram is not heavily skewed justifies that there are no potential outliers present in the data. Also upon visualization the graph is right skewed and as mean is higher than the median justifies the same. The plot does not require any transformation since the tail of the histogram can be seen properly. Here the blue line is depicting the median and red line depicts mean.


Central Tendency - I chose mean as my central tendency since for our given plot the mean is greater than the median and also since our price variable contains exact values instead of appropriate ones.

mean(udemy$price)
## [1] 66.08542


As our graph is right skewed so the best way to measure spread is by calculating the Inter-Quartile Range

IQR(udemy$price)
## [1] 75


The shape of the distribution is unimodal and it is right/positive skewed.


Univariate plot 2

Proportion wise

udemy_bar2.1 <- ggplot(udemy, aes(x = year, y = ..prop..,group = 1),stat='count') 
udemy_bar2.1 + geom_bar(color = "darkslategray", fill = "lightblue") +
  ggtitle("Courses made according to each year in proportions") + 
  labs(x = "Year", y = "Proportion")


Count wise

udemy_bar2.2 <- ggplot(udemy, aes(x = year)) 
udemy_bar2.2 + geom_bar(color = "darkslategray", fill = "lightblue") +
  ggtitle("Courses made in each year") + 
  labs(x = "Year", y = "Number of Courses made")


udemy %>% count(year)


We can observe that over the years till 2016 more and more courses were made for the masses. However the number of courses decreased in 2017 as compared to 2016. I also have counted the number of courses in each year to determine the exact pattern in the data. I have in total of 7 unique values which signifies a considerable amount to show trend of a data.


Bivariate Analysis

Bivariate plot 1

udemy_bar3_filter <- udemy %>% 
  mutate(review_filter = num_reviews, lecture_filter = num_lectures) %>%
  filter(review_filter < 200, lecture_filter < 200)

udemy_bar3 <- ggplot(udemy_bar3_filter, aes(x = lecture_filter, y = review_filter))
udemy_bar3 + geom_point(aes(color = subject),alpha = 0.8) + 
  geom_smooth(method = lm, linetype = "dashed")+ coord_flip() +
  ggtitle("Number of lectures with it's reviews for first 200 courses") + 
  labs(x = "Number of lectures", y = "Number of reviews", fill = "Subjects")
## `geom_smooth()` using formula 'y ~ x'


cor(udemy_bar3_filter$lecture_filter,
    udemy_bar3_filter$review_filter)
## [1] 0.230614

The scatterplot shows a weak and non-linear association between number of lectures and reviews as the coefficient of correlation calculated above is close to 0. The direction of the plot is neither positive nor negative. It depicts how many reviews are available corresponding to the number of lectures in each course. As we can see the smooth dotted curve is more towards number of lectures representing that more the number of lectures lesser are the reviews.


Bivariate plot 2

udemy_subscribers <- udemy %>% 
  mutate(subs_filter = num_subscribers) %>%
  filter(subs_filter < 300)

udemy_bar4 <- ggplot(udemy_subscribers, aes(x = is_paid, y = log(num_subscribers + 1)))
udemy_bar4 + geom_boxplot(color="indianred",fill="sienna2") +
  ggtitle("Number of subscribers paying for courses") + 
  labs(x = "Paid or not", y = "Number of subscribers")

It seems that there are more users on udemy opting for the free courses rather than paid ones. The form, strength and direction cannot be calculated as one of the variable is boolean data which is non-numeric.

Nivetha Ravi

##Univariate Analysis##

##Numeric variable##

##1. Create an appropriate plot to visualize the distribution of this variable##

UM<-ggplot(udemy, aes(x = price)) +
  geom_bar( color = "orange") +
  labs(x = "Price of Each Course(In Dollars)",
       y = "Count",
       title = "Distribution of prices for each course") +
  theme_minimal()
ggplotly(UM)

##2.Consider any outliers present in the data. If present, specify the criteria used to identify them and provide a logical explanation for how you handled them##

There is no outliers in this graph as the price of each course are distributed randomly within 200$.

##3.Describe the shape and skewness of the distribution##

The graph is Unimodel and it is highly right-skewed

##4.Based on your answer to the previous question, decide if it is appropriate to apply a transformation to your data. If no, explain why not. If yes , name the transformation applied and visualize the transformed distribution##

As the data distribution is not extremely skewed there is no need to to apply transformations to the current data as there are valid data point from the visualization displaying the data points.

If we apply a transformation to this data the visualization might look good to view. However, it might be difficult to interpret as the log of a measured value is usually meaningless and we will not be able to view the data points and the actual data trends from the visualization.

##5.. Choose and calculate an appropriate measure of central tendency##

As the data distribution is skewed, it is recommended to use ‘Median’ to calculate the Central tendency.

median(udemy$price)
## [1] 45

##6.Explain why you chose this as your measure of central tendency. Provide supporting evidence for youe choice##

avg_price = mean(udemy$price)
median_price = median(udemy$price)
ggplot(udemy, aes(x = price)) +
  geom_bar(stat = "bin", fill = "steelblue") +
  labs(x = "Priceof each course",
       y = "Count",
       title = "Distribution of price for each course") +
  theme_minimal() +
  geom_vline(xintercept = avg_price,
             color = 'red',
             size = 1.5) +
  geom_vline(xintercept = median_price,
             color = 'blue',
             size = 1.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

udemy %>%
  summarise(
    mean = mean(price),
    median = median(price),
    std_dev = sd(price),
    IQR = IQR(price)
  )

It is clearly visible from both the plot and statistics, the graph is rightly skewed and the mean is higher than the median for this variable.For variables with such behaviour, median is more appropriate compared to mean.

##7.Choose and calculate a measure of spread that is appropriate for your chosen measure of central tendency. Explain why you chose this as your measure of spread.

IQR(udemy$price)
## [1] 75

The central tendancy value as per the above calculation is 75. The graph is rightly skewed and the mean is higher than the median for this variable.For variables with such behaviour, IQR is more appropriate to calculate the spread compared to standard deviation.

##CATEGORICAL VARIABLE##

##1. Create an appropriate plot to visualize the distribution of counts for this variable##

ggplot(udemy, aes(y = level, fill = factor(level)))+geom_bar()+labs(x = "Count",
       y = "Levels",title = "Distribution of counts for each level"  )

##2.Create an appropriate plot to visualize the distribution of proportions for this variable.

ggplot(udemy , aes(y = level , x = ..prop.., group = 1), stat = 'count') + geom_bar(fill = 'indianred' , colour = 'black')+labs(x = "Proportions",
       y = "Levels",title = "Distribution of proportion for each level")

##3.Discuss any unusual observations for this variable?

udemy %>% group_by(level) %>% summarise(n=n()) %>% mutate(prop=n/sum(n))

There are few unusual observations as seen above that All Levels has 1927 and the expert level is only 57 which is the lowest one among the levels.

##4.Discuss if there are too few/too many unique values?

unique(udemy$level)
## [1] "All Levels"         "Intermediate Level" "Beginner Level"    
## [4] "Expert Level"

There are unique values in All Levels followed by intermediate level.

##BIVARIATE ANALYSIS

##TWO NUMERIC VARIABLE##

##1.Create an appropriate plot to visualize the relationship between the two variables##

ggplot(udemy,aes(price,num_lectures,colour = "RED"))+geom_quantile()+geom_smooth(colour = "Black")+
  labs(x= "Price" , y= "Number of Lectures",title = "Price for Each Number of Lectures")
## Smoothing formula not specified. Using: y ~ x
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

##2.Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate##

cor(udemy$price , udemy$num_lectures)
## [1] 0.3302212

From the correlation summary and plot, it is visible that there is a strong correlation between num_lectures and price.

##3.Explain what this relationship means in the context of the data##

From the above plot, we can see that the price gradually increases when the number of lectures increases

##4.Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above##

The variability I can see here is though the price increases according to the number of lectures there is also free course which has respective content duration. Also, between the price 100 to 150 dollars the number of lectures decreases and then the price gradually increases. This variation has contributed to increasing the strength of relationship between price and number of lectures as it may decrease or increase.

##ONE NUMERIC AND ONE CATEGORICAL VARAIBLE##

##1.Create an appropriate plot to visualize the relationship between the two variables##

UM<-ggplot(udemy,aes(x = year, y = price , color = factor(year)))+geom_jitter()+labs(x = "Price",
       y = "Year",title = "Relationship between Price and Year")
ggplotly(UM)

##2.Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures , as appropriate##

There is no certain relationship between these two variables.Both variables are independent of each other. so, we cannot calculate quantitative measures like correlation and covariance for any pair of variables where one of the variables is categorical variable.

##3.Explain what this relationship means in the context of the data##

From the above jitter plot it can be seen that as the year goes by the price also increases.

##4.Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above##

We cannot determine the variability with a covariance between the variables ‘price’ and ‘year’ as one variable is Numeric and the other is Categorical and these variables are independent of each other.

Yash Bhavsar

1) Univariate Analysis

1.1) Univariate Analysis for numeric variable:-

  1. Create an appropriate plot to visualize the distribution of numeric variable.
ggplot(udemy, aes(x = price,fill = level))+
  geom_density( alpha = 0.5)+
  facet_wrap(~level)+
  geom_vline(xintercept = mean(udemy$price), colour = 'red',size = 1)+
  geom_vline(xintercept = median(udemy$price), colour = 'Blue', size = 1)+
  labs( y = "Density", x = "Price of Course", fill = "Level of Subject", title= "Distribution of Numeric Variable")

  1. No, there is no outliers present in my data as all the prices are equally divided throughout all the levels of courses.

  2. The distribution is Right Skewed as it has long tail extending towards right. Also we can say for some levels the distribution is Bimodal as we do not have any outliers.

  3. No, We do not need to apply any type of transformation as we have used a density plot,the distribution is clearly visible however, if we use histogram plot then transformation might be needed.

summary(udemy$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   45.00   66.09   95.00  200.00
  1. Mean = 66.09, median = 45.00 Median is selected as central Tendency.

  2. As we can see in the plots the median value is less than mean value and almost half of the data is on the other side of median so it is preffered to use Median for filling up in null values and it consider as a central tendency value as our data is not symmetric/ skewed.

sd(udemy$price)
## [1] 61.00289
IQR(udemy$price)
## [1] 75
var(udemy$price)
## [1] 3721.352
  1. As data/Plot is right skewed we will use IQR as measure of spread that is appropriate for chosen measure of central tendency.

1.2) Categorical variable univarient analyasis

  1. Create an appropriate plot to visualize the distribution of counts for Categorical variable.
ggplot(udemy, aes(x = subject, fill = is_paid))+
  geom_bar()+
  labs( y = "Number of Students Enrolled", x = "Subjects", fill = 'Subject is Paid',title= "Distribution of Counts")

  1. Create an appropriate plot to visualize the distribution of proportions for Categorical variable.
ggplot(udemy, aes(x = subject,y = ..prop..,group = 1),stat = 'count')+
  geom_bar(colour = 'black',fill = "pink")+
  labs( y = "Proportion of Students Enrolled", x = "Subjects",title= "Distribution of Categorical Variable")

  1. Here the observation for the distribution count is that more and more paid courses been selected and the plot is bimodal as it has two peaks for enrolled students.

  2. There are very less unique values as the data is evenly distrusted.

2) Bivariate Analysis

2.1) pair of variables where both are numeric.

  1. Create an appropriate plot to visualize the relationship between the two variables.
ggplot(udemy, aes (x = price, y = content_duration,fill = is_paid, colour = is_paid))+
  geom_point(alpha = 0.3)+
  geom_smooth(method = 'lm')+
  labs( y = "Duration of Content", x = "Price",fill = 'Subject is Paid',colour = 'Subject is Paid',title= "Distribution of Content Vs Price")
## `geom_smooth()` using formula 'y ~ x'

cor(udemy$price,udemy$content_duration)
## [1] 0.2938713
  1. It is seen that the relation is liner as the price increases the content duration also increases.As seem from the correlation function the the price and content duration is on the weaker side as we can see from the plot.

  2. As we can see that the free courses care having maximum 20 hours of content while when price is gradually increases the duration also increases but still the price and duration cannot give a specific co relation as we have some subjects which are at much higher price and have low content duration as they are at expert level.

  3. The variability I observed is positive weak linear relation as we calculated the corelation which was also too low at 0.29.

2.2) Pair of variables where one variable is categorical and the other is numeric.

  1. Create an appropriate plot to visualize the relationship between the two variables.
ggplot(udemy, aes (y = price, x = subject, fill = subject))+
  geom_boxplot()+
  labs( y = "Cost of Subjects", x = "Subjects",fill = 'Subjects',title= "Subjects Vs Price")

  1. It seems that there is non linear relation as the cost is not stagenet for each subjects, however the median cost is also varried upon the subjects. Also there are more potential outliers for subject Musical instruments and Graphic Design Compared with Business Finance and Web Development. Also it is seen that the highest paid courses are from all the subjects considering the outliers.

  2. It is seen that the most popular subjects are Business finance and Web development as they have all types of courses from variety of price and level. This relation is Obviously non linear as here is not specific growth in th boxplots but if we see the median all the medians are in a strong Linar Relation with cost and subjects.

  3. As we cannot calculate correlation coefficient for a numeric and Categorical values we only have to observe the box plot but it is clear seen that our plot is non linear in an u shape but the median are strongly related in positive direction as medial of all subjects with price is around 50.

Bhupesh Reddy Challa

Loading libraries

library(tidyverse)
library(dplyr)
library(here)
library(ggplot2)
library(quantreg)

Loading data

uc_df <- read.csv("Data/udemy_courses.csv")

head(uc_df)

Viewing data

glimpse(uc_df)
## Rows: 3,676
## Columns: 12
## $ course_id        <int> 1070968, 1113822, 1006314, 1210588, 1011058, 19287...
## $ course_title     <chr> "Ultimate Investment Banking Course", "Complete GS...
## $ url              <chr> "https://www.udemy.com/ultimate-investment-banking...
## $ is_paid          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
## $ price            <int> 200, 75, 45, 95, 200, 150, 65, 95, 195, 200, 200, ...
## $ num_subscribers  <int> 2147, 2792, 2174, 2451, 1276, 9221, 1540, 2917, 51...
## $ num_reviews      <int> 23, 923, 74, 11, 45, 138, 178, 148, 34, 14, 93, 42...
## $ num_lectures     <int> 51, 274, 51, 36, 26, 25, 26, 23, 38, 15, 76, 17, 1...
## $ level            <chr> "All Levels", "All Levels", "Intermediate Level", ...
## $ content_duration <dbl> 1.5000000, 39.0000000, 2.5000000, 3.0000000, 2.000...
## $ year             <int> 2017, 2017, 2016, 2017, 2016, 2014, 2016, 2015, 20...
## $ subject          <chr> "Business Finance", "Business Finance", "Business ...

Data mutated below added additional variables.

UNIVARIANT 1 NUMERIC DATA

summary(uc_df$num_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       4      18     154      67   27445

Number of reviews

  1. Create an appropriate plot to visualize the distribution of this variable.

p <- ggplot(uc_df,mapping = aes(num_reviews)) + geom_histogram(bins = 30)+
  labs(x="Reviews", y="Count",title = "Number of reviews")
 p

  1. Consider any outliers present in the data. If present, specify the criteria used to identify them and provide a logical explanation for how you handled them.

  • The data is extremely skewed for few of the reviews it can be considered as a correct data.

  • The reviews count can not be removed as a outliers because the number of reviews provided the larger information on the opinion of the course.

  • Inorder to extract much more information from the graph we need to transform it.

  • We can not visualize from the plotted histogram as there is very less to study from the plot.

  1. Describe the shape and skewness of the distribution.

  • Its unimodal and right skewed.

  1. Based on your answer to the previous question, decide if it is appropriate to apply a transformation to your data. If no, explain why not. If yes, name the transformation applied and visualize the transformed distribution.

  • Yes it is needed to transform the data for revealing more information from the plot.

  • Applied transformation is log10, it gave the broader scope of the data.

Below plot shows the transformed data.

c <- ggplot(uc_df,mapping = aes(log10(num_reviews+1))) + geom_histogram(bins = 30L ,alpha=0.6)

c + theme_grey()+labs(x="Reviews", y="Count",title = "Number of reviews")

  1. Choose and calculate an appropriate measure of central tendency.

mean <- mean(log10(uc_df$num_reviews+1))

median <- median(log10(uc_df$num_reviews+1)) 

diff_mean_median <- mean-median

diff_mean_median
## [1] 0.06125395

  • As the variation between mean and median is minute we can say that the histogram is symmetric.

c <- ggplot(uc_df,mapping = aes(log10(num_reviews+1))) + geom_histogram(bins = 30L ,alpha=0.6)


cv <- c+ geom_vline(aes(xintercept=mean(log10(num_reviews+1))), color="1", linetype="dashed", size=1) +   
  geom_vline(aes(xintercept=median(log10(num_reviews+1))), color="6", linetype="dashed", size=1)

cv + theme_grey()+labs(x="Reviews(log10)", y="Count",title = "Number of reviews")

  1. Explain why you chose this as your measure of central tendency. Provide supporting evidence for your choice.

  • I have selected mean as the central tendency because it is calculated by the all observations values of data. For a symmetric plot considering mean value is prefered.

  1. Choose and calculate a measure of spread that is appropriate for your chosen measure of central tendency. Explain why you chose this as your measure of spread. (2 marks)

std_num_reviews <- sd(log10(uc_df$num_reviews+1))

std_num_reviews
## [1] 0.7563908

  • As the histogram is symmetric i have used the standard deviation as the measure of spread.

UNIVARIANT 1 CATEGORICAL VARIABLE

  1. Create an appropriate plot to visualize the distribution of counts for this variable.

ggplot(uc_df, mapping = aes(is_paid,fill=is_paid))+geom_bar()+
  labs(x="Is Paid", y="Count",title = "Number of Paid")+
  theme_minimal()

  1. Create an appropriate plot to visualize the distribution of proportions for this variable.

uc_df_mt <- uc_df %>% group_by(is_paid)

ggplot(uc_df_mt, mapping = aes(factor(is_paid),y=..prop..,group =1))+ geom_bar(stat = "count",fill="blue",alpha=0.4)+
  labs(x="Is Paid", y="Proportions",title = "Proportions of Paid Values")+
  geom_text(aes(label=..prop..),stat="count",position = position_dodge(0.9), vjust=-0.5)

  • The proportions of the paid and unpaid courses are shown in above graph.

  1. Discuss any unusual observations for this variable?

  • This variable does not have any unusual observations, it is proportionately distributed for TRUE and FALSE cases. displaying the percentage based on data.

  1. Discuss if there are too few/too many unique values?

  • There are zero unique values, this variable contains only two types one is True for the courses that are for cost and False for the courses which are free of cost.

BIVARIANT ANALYSIS

  1. Create an appropriate plot to visualize the relationship between the two variables.

Pair of numeric variables relationship.

Considered two numeric variables “sales” and “num_lectures”, sales variable was created from price and number of subscribers for a particular course.

Created a new variable sales using mutate function.

uc_df_mut_sales <- uc_df %>% mutate(sales=uc_df$price*uc_df$num_subscribers)
ggplot(uc_df_mut_sales, aes(log10(sales+1),log10(num_lectures+1))) + 
  geom_point(alpha=0.4)+geom_smooth(color="red") + 
  labs(x="Sales (log10 scale)",y="Number of lectures (log10 scale)", title="The number of lectures for sales")

  • As the plot is too compact there will be difficulty in visuvalizing the plot, so the sales value has been plotted in log10 scale so that we could depict the scatter plot and analyze the trends.

  1. Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.

Qualitative

The below details are observered from the plot.

  • Form is alomst linear.

  • Direction is positive.

  • Strength is very weak.

Also consists of many outliers in this scenario.

Quantitative

Correlation between two variables(sales and number of lectures)

cor(uc_df_mut_sales$sales,uc_df_mut_sales$num_lectures)
## [1] 0.3218139

  1. Explain what this relationship means in the context of the data. (4 marks)

  • The relationship between the two variables are linear, the courses which are free of cost have a number of lectures ranging from 100-200 and courses with price have a different trend i.e for the higher sales of the course majority of the courses have 100 to 200 lectures. Furthermore when sales has crossed “4” of log10 scale the lectures count increased.

  • Finally there is no clear pattern to define the relationship between these two variables as the data points are scattered rather than being on the smooth line.

  • We can define the form to be linear as there is a increasing curve i.e as the number of lectures increases sales tends to increase.

  • As we can see from the plot this relationship of variables is very weak, the points are scattered and most of the points are far away from the smooth curve.
    1. Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.

    • As we can see from the plot this relationship of variables is very weak, the points are scattered and most of the points are far away from the smooth curve which will make the relationship weaker. The points spread away from the smooth line is also broader which makes it weak relationship.

    • The correlation value also shows the same that the correlation value between the two variables is 32% which is too less.

    • Direction is also shown as positive in both quantitatively and qualitatively.

    One numeric and one categorical variable.

    1. Create an appropriate plot to visualize the relationship between the two variables.

    plotting the visualization between sales(numeric) and level(categorical).

    ggplot(uc_df_mut_sales, aes(log10(sales+1),level)) +
      geom_boxplot(aes(fill=level),outlier.shape = 9)+
       stat_summary(fun = base::mean, geom = "point", color ="red", size = 4)+ theme_bw()+
        labs(x=" Total sales  ",y= "Levels",title = "Sales for each level")

    • The All levels have a higher median value amongst all other levels and the lowest is for beginner level.

  • based on the boxplot for level wise all the level data are left skewed.

  • There are outliers represented in different shape. they exist in all levels and Intermediate level categories.

  • median(log10(uc_df_mut_sales$sales))
    ## [1] 4.363424
    mean(log10(uc_df_mut_sales$sales+1))
    ## [1] 3.926521

    1. Describe the form, direction, and strength of the observed relationship. Include both qualitative and quantitative measures, as appropriate.

    Quantitative measures

    median_sales <- median(log10(uc_df_mut_sales$sales+1))
    mean_sales <- mean(log10(uc_df_mut_sales$sales+1))
    
    
    mean_sales-median_sales
    ## [1] -0.4369218

    Median is greater than mean so that means the skewness is towards left.

    **

    Qualitatively

    **

    • The boxplot provides the variations in median and mean values from the each level for understanding the skewness. median is greater than mean.

    • Beginner level have the highest number of sales and the least is for expert level.

    • Outliers exist for All levels and Intermediate courses.

    1. Explain what this relationship means in the context of the data.

    • From the graph we can say that the median is always greater for all the levels in total sales.

    • Beginner level have the most data values and second is for the intermediate level. However we can see that for expert value the mean and median is almost near to each other. Which means that data is uniform for the expert level courses.

    1. Describe the variability that you observe in the plot and how that corresponds to the strength you calculated in #2 above.

    • The categorical and numeric value relationship can not be defined with a relationship of correlation. This can be done by using other test such as t-test,z-test and ANOVA test.

    • However the mean and median values plotted in the graph and the quantitative values have produced same outputs there is no variability in the observations.

                                               ***THE END ***